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Abstract —In a noncooperative dynamic game, mnitiple agents 
operating in a changing environment aim to optimize their 
utilities over an infinite time horizon. Time-varying environments 
allow to model more realistic scenarios (e.g., mobile devices 
eqnipped with batteries, wireless communications over a fading 
channel, etc.). However, solving a dynamic game is a difficnit 
task that requires dealing with multiple coupled optimal control 
problems. We focus our analysis on a class of problems, named 
dynamic potential games, whose sointlon can be found through 
a single multivariate optimal control problem. Our analysis 
generalizes previous studies by considering that the set of envi¬ 
ronment’s states and the set of players’ actions are constrained, 
as it is required by most of the applications. We also show that 
the theoretical results are the natural extension of the analysis 
for static potential games. We apply the analysis and provide 
numerical methods to solve four key example problems, with 
different features each: i) energy demand control in a smart- 
grid network, ii) network flow optimization in which the relays 
have bounded link capacity and limited battery life. Hi) uplink 
multiple access communication with users that have to optimize 
the use of their batteries, and iv) two optimal scheduling games 
with nonstationary channels. 

Index Terms —Dynamic games, dynamic programming, game 
theory, multiple access, network flow, optimal control, resource 
allocation, schedniing, smart grid. 


I. Introduction 

G ame theory is a field of mathematics that studies con¬ 
flict and cooperation between intelligent decision makers 
Q. It has become a useful tool for modeling communication 
and networking problems, such as power control and resource 
sharing (see, e.g., 0X wherein the strategies followed by 
the users (i.e., players) influence each other, and the actions 
have to be taken in a decentralized manner. However, one 
main assumption of classic game theory is that the users 
operate in a static environment, which is not influenced 
by the players’ actions. This assumption is unrealistic in 
many communication and networking problems. For instance, 
wireless devices have to maximize throughput while facing 
time-varying fading channels, and mobile devices may have 
to control their transmitter power while saving their battery 
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level. These time-varying scenarios can be better modeled by 
dynamic games. 

In a noncooperative dynamic game, the players compete in 
a time-varying environment, which we assume can be char¬ 
acterized by a deterministic discrete-time dynamical system 
equipped with a set of states and a Markovian state-transition 
equation. Each player has its utility function, which depends 
on the current state of the system and the players’ current 
actions. Both the state and action sets are subject to constraints. 
Since the state-transitions induce a notion of time-evolution 
in the game, we consider the general case wherein utilities, 
state-transition function and constraints can be nonstationary. 
A dynamic game starts at an initial state. Then, the players 
take some action, based on the current state of the game, and 
receive some utility values. Then, the game moves to another 
state. This sequence of state-transitions is repeated at every 
time step over a (possibly) infinite time horizon. We consider 
the case in which the aim of each player is to And the sequence 
of actions that maximizes its long term cumulative utility, 
given other players’ sequence of actions. Thus, a game can 
be represented as a set of coupled optimal-control-problems 
(OCP), which are difficult to solve in general. Fortunately, 
there is a class of dynamic games, named dynamic potential 
games (DPG), that can be solved through a single multivariate- 
optimal-control-problem (MOCP). The benefit of DPG is that 
solving a single MOCP is generally simpler than solving a set 
of coupled OCP (see for a recent survey on DPG). 

The pioneering work in the held of DPG is that of 0, 
later extended by Q and |j^. There have been two main 
approaches to study DPG; the Euler-Lagrange equations and 
the Pontryagin’s maximum (or minimum) principle. Recent 
analysis by Q and |j7) used the Euler-Lagrange with DPG 
in its reduced form, that is when it is possible to isolate the 
action from the state-transition equation, so that the action is 
expressed as a function of the current and future (i.e., after 
transition) states. Consider, for example, that the future state 
is linear in the current action; then, it is easy to invert the 
state-transition function and rewrite the problem in reduced 
form, with the action expressed as a function of the current 
and future states. However, in many cases, it is not possible to 
And such reduced form of the game (i.e., we cannot isolate the 
action) because the state-transition function is not invertible 
(e.g., when the state transition function is quadratic in the 
action variable). The more general case of DPG in nonreduced 
form was studied with the Pontryagin’s maximum principle 
approach by 0 and Q for discrete and continuous time 
models, respectively. However, in all these studies 0-@> 
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the games have been analyzed without explicitly considering 
constraints for the state and action sets. 

Other works that consider potential games with state- 
dynamics include 0-ini- However, these references study 
the myopic problem in which the agents aim to maximize 
their immediate reward. This is different from DPG, where 
the agents aim to maximize their long term utility by solving 
a control problem. 

Dynamic games offer two kinds of possible analysis based 
on the type of control that players use. These cases are 
normally referred to as open loop (OL) and closed loop (CL) 
game analysis. In the open loop approach, in order to hnd the 
optimal action sequence, the players have to take into account 
other players’ action sequences. On the other hand, in a closed 
loop approach, players hnd a strategy that is a function of the 
state, i.e., it is a mapping from states to actions. Thus, in order 
to hnd their optimal policies, they need to know the form 
of other players’ policy functions. The OL analysis has, in 
general, more tractable analysis than the CL analysis. Indeed, 
there are only few CL known solutions for simple games, 
such as the hsh war example presented in p^ , oligopolistic 
Cournot games or quadratic games ]14). 

The main theoretical contribution of this work is to analyze 
DPG with constrained action and state sets, as it is required 
by most of applications (e.g., in a network How problem, the 
aggregated throughput of multiple users is bounded by the 
maximum link capacity; or in cognitive radio, the aggregated 
power of all secondary users is bounded by the maximum 
interference allowed by the primary users). To do so, we 
apply the Euler-Lagrange equation to the Lagrangian (as it 
is customary in the MGCP literature fTSj), rather than to the 
utility function (as done by earlier works 0 and 0 )- Using 
the Lagrangian, we can formulate the optimality condition in 
the general nonreduced form (i.e., it is not necessary to isolate 
the action in the transition equation). In addition, we establish 
the existence of a suitable conservative vector held as an easily 
verihable condition for a dynamic game to be of the potential 
type. To the best of our knowledge, this is a novel extension 
of the conditions established for static games by GD and HI)- 

The second main contribution of this work is to show that 
the proposed framework can be applied to several commu¬ 
nication and networking problems in a unihed manner. We 
present four examples with increasing complexity level. First, 
we model the energy demand control in a smart grid network 
as a linear-quadratic-dynamic-game (LQDG). This scenario 
is illustrative because the analytical solution of an LQDG 
is known. The second example is an optimal network flow 
problem, in which there are two levels of relay nodes equipped 
with finite batteries. The users aim to maximize their How 
while optimizing the use of the nodes’ batteries. This problem 
illustrates that, when the utilities have some separable form, 
it is straightforward to establish that the problem is a DPG. 
However, the analytical solution for this problem is unknown 
and we have to solve it numerically. It turns out that, since 
all batteries will deplete eventually, the game will get stuck in 
this depletion-state. Hence, we can approximate the inhnite- 
horizon MGCP by an effective hnite-horizon problem, which 
simplihes the numerical computation. The third example is 


an uplink multiple access channel wherein the users’ devices 
are also equipped with batteries (this example was introduced 
in the preliminary paper |18|). Again, the simple—^but more 
realistic—extension of battery-usage optimization makes the 
game dynamic. In this example, instead of rewriting the util¬ 
ities in a separable form, we perform a very general analysis 
to establish that the problem is a DPG. The fourth example 
studies two decentralized scheduling problems: proportional 
fair and equal rate scheduling, where multiple users share a 
time-varying channel (see the preliminary paper 0). This 
example shows how to use the proposed framework in its 
most general form. The problems are nonconcave and the 
utilities have a nonobvious separable form. The problem is 
nonstationary, with state-transition equation changing with 
time. And there is no reason that justihes a hnite horizon 
approximation of the problem, so we have to use optimal 
control methods (e.g., dynamic programming) to solve it 
numerically. 

Outline: Sec. [H] introduces the problem setting, its solution 


and the assumptions on which we base our analysis. In Sec. Ill 


we review static potential games together with the instrumental 
notion of conservative vector held. In Sec. HYi we provide 
sufficient conditions for a dynamic game with constrained state 
and action sets to be a DPG, and show that a DPG can be 
solved through and equivalent MGCP. Sections V - VIII deal 
with application examples, the methods for solving them, and 
some illustrative simulations. We provide some conclusions in 
Sec.HX] 


IT Problem Setting 

Let Q = Q} denote the set of players and let 

X C 3?'^ denote the set of states of the game. Note that 
the dimensionality of the state set can be different from the 
number of players (i.e., S f Q). At every time step t, the state- 
vector of the game is represented by Xj = G X. 

Every player i G Q can be inhuenced only by a subset 
of states C X. The partition of the state space X 
among players is done in the component domain. We dehne 
A’(j) C S'} as the subset of indexes of state-vector 

components that inhuence player i, then 
indicates the value of the state-vector for player i at time t. 
This generality allows for games in which multiple players are 
affected by common components of the state vector (e.g., when 
they share a common resource), and includes the particular 
case wherein they share no components. We also dehne 
^ vector of components that 

do not inhuence player i, for some subset X~'^ C X. 

Let 14 C Sfi'S denote the set of actions of all players, and 
let if® C 3? stand for the subset of actions of player i, such 
that 14 = n^i The extension to higher dimensional action 
sets is straightforward (i.e., when if® C 3?^’), but we restrict 
to scalar actions in order to simplify notation (the general case 
will be introduced when necessary for some of the application 
examples). We write ul G if® the action variable of player 
i at time t, such that the vector Ut = G if 

contains the actions of all players. We also dehne u^® = 

G if-® = as the 







3 


vector of all players’ actions except that of player i. Hence, 
by slightly abusing notation, we can rewrite U( = 

The state transitions are determined hy f : X xU xK —i' X, 
such that the nonstationary Markovian dynamic equation of 
the game is Xj+i = which can be split among 

components: for fc = such 

that / = The dynamic is Markovian because the 

state transition to Xj+i depends on the current state-action 
pair (xtjUj), rather than on the whole history of state-action 
pairs {(xq, Uo),.. . (xt, Ut)}. We remark that / corresponds 
to a nonreduced form, such that there is no function ip such 
that ut = 

We include a vector of C nonstationary constraints g = 
as it is required by most applications, and dehne the 
sets Ct = {X xZ^}n{(xt,Ut) : g{xt,Ut,t) < 0}n{(xt,Ut) : 
Xt+i = f{xt,Ut,t)}. 

Each player has its nonstationary utility function tt* : 
X^ xU X IN —3?, such that, at every time t, each player 
receives a utility value equal to 7r®(xJ, mJ, f). The aim of 
player i is to hnd the sequence of actions {uq, ... ,ul,...} that 
maximizes its long term cumulative utility, given other players’ 
sequence of actions ..., ...}. Thus, a discrete¬ 

time inhnite-horizon noncooperative nonstationary Markovian 
dynamic game can be represented as a set of Q coupled 
optimal control problems: 

OO 

'^f3*TT\xl,ul,U-\t) 

t=0 (1) 

xt-ri = xo given 

g{xt,ut,t) < 0 

where 0 < /3 < 1 is the discount factor that bounds the 
cumulative utility (for simplicity, we dehne the same /3 for 
every player). Note that, since the players can share state- 
vector components, the constraints may affect every player’s 
feasible region. Problem Q is inhnite-horizon because the 
reward is accumulated over inhnite time steps. 

The solution concept of problem Q in which we are 
interested is the Nash Equilibrium (NE) of the game, which 
is dehned as follows. 

Definition 1. A solution of problem Q. known as a Nash 

Equilibrium (NE), is a feasible sequence of actions 

that satisfies the following condition for every player i € Q: 

OO OO 

t^O t^o 

V(xt,Ut)eCt (2) 


maximize 

Gi : {“i}Gn”o^’ 

Vi G Q s.t. 


We consider the following assumptions: 

Assumption 1. The utilities tt* are twice continuously differ¬ 
entiable in X xlA. 

Assumption 2. The state and action spaces, X and lA, are 
open and convex subsets of a real vector space. 

Assumption 3. The state-transition function f and the con¬ 
straints g are continuously differentiable in X xlA and satisfy 


some regularity conditions. 

In general, hnding an NE of problem Q is a difficult 
task because the utilities, dynamic equation and constraints 
of the individual optimal control problems (OCP) are coupled 
among players. However, when problem ([T]) is a DPG, we 
can solve it through an equivalent MOCP—as opposed to 
a set of coupled univariate OCP. We use Assumptions [T] 
and 1^ to obtain a verihable condition for problem Q to be 
a DPG. Assumption is required to introduce the condi¬ 
tions that guarantee equivalence between the solution of the 
MOCP and an NE of the original DPG. In particular, since 
we derive the KKT optimality conditions for both problems 
(namely the DPG and the MOCP), some regularity conditions 
(such as Slater’s, the linear independence of gradients or the 
Mangas arian-Eromovitz constraint qualihcations) are required 
to ensure that the KKT conditions hold at the optimal points 


and that feasible dual variables exist (see, e.g., |20 Sec. 3.3.5], 


1211). Einally, we introduce one further assumption in Sec. IV 
to ensure existence of a solution to the MOCP and, hence, 
existence of an NE of the DPG. 

This equivalence between DPG and MOCP generalizes the 
well studied but simpler case of static potential games GD, 
GZl’ which is reviewed in the following section. 


III. Overview of Static Potential Games 


Static games are a simplihed version of dynamic games in 
the sense that there are neither states, nor system dynamics. 
The aim of each player i, given other players’ actions u“®, is 
to choose an action it* G W that maximizes its utility function: 

maximize tt* (u*, ) 

02 : Vi G Q “‘eW (3) 

s.t. g{u) < 0 


where (similar to dynamic games but removing the time- 
dependence subscript) rt® G refers to the action of player 
i; and is the set of actions of the rest of 

agents, such that u = (u®,u“*) G U denotes the set of all 
players’ actions. We assume U C to be open and convex. 

In general, hnding or even characterizing the set of equi¬ 
librium points (e.g., in terms of existence or uniqueness) of 
problem ([^ is difficult. Eortunately, there are particular cases 
of this problem for which the analysis is greatly simplihed. 
Potential games is one of these cases. 


Definition 2. Let Assumptions^I^^hold. Then, problem (|^ is 
called a static potential game if there is a function H : U ^ (H, 
named the potential, that satisfies the following condition for 
every player 1^: 

TTfuf u-*) - 7r\v\ u-*) = Il{u\ u-*) - n(i;\ 

\lu\N ^U\ ViGQ 


Under Assumptions [TJ|^ it can be shown (see, e.g., Gil 
Lemma 4.4]) that a necessary and sufficient condition for a 
static game to be potential is the following: 


i97r®(u) 

du^ 


an(u) 

du'- 


Vi G Q 


(5) 


We can gain insight on potential games by relating Q to 
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the concept of conservative vector field. The following lemma 
will be useful to this end. 


Lemma 1. Let F(u) = (T'i(u),... ,Fq{u)) be a vector field 
with continuous derivatives defined over an open convex set 
lA G The following conditions on F are equivalent: 

1) There exists a scalar potential function n(u) such that 
F(u) = vn(u), where V is the gradient. 

2) The partial derivatives satisfy 


dFjju) 

du^ 


dF^ju) 

dui 


AmGU, (6) 


3) Let a. be a fixed point of U. For any piecewise smooth 
path ^ joining a with u, we have n(u) = F(^) • c?$. 

A vector field satisfying these conditions is called conservative. 


Proof: See, e.g., |22 Theorems 10.4, 10.5 and 10.9]. 


Let us dehne a vector held with components the partial 
derivatives of the players’ utilities: 

/ 97r^(u) 57r'^(u) 

\ du^ ’ ’ du^ 

Let us rewrite (|7]| more compactly as F(u) = Vn(u) so that 




(7) 


Lemma 


0 


1 holds. Then, we have that 


(l 

du'^ 


an(u) 

du^ 


Vi G 


Q. Note mat this is exactly condition given by 0- It follows 
from Lemma [T]2 that a necessary, sufficient and also easily 
verihable condition for problem ([^ to be a static potential 
game is given by: 


9^7r®(u) 

du'-dui 


9^7r-^(u) 
du^dui ’ 


Vi,j G Q 


( 8 ) 


Finally, Lemma [T|3 is useful since we can hnd the potential 
function If by solving the line integral of the held: 


n(u) 



du^ 


u-*) dC{X) 


d\ 


(9) 


where ^ ^ piecewise smooth path in U that 

connects the initial and hnal conditions: ^(0) = a, ^(1) = u. 

Once we have found If, it can be seen GD that necessary 
conditions for u* to be an equilibrium of the game 0 are also 
necessary conditions for the following optimization problem: 

maximize n(u) 

Vi : (10) 

s.t. p(u) < 0 

Indeed, optimization theorems concerning existence and con¬ 
vergence can now be applied to game (|^. In particular, 
reference fT^ showed that the local maxima of the potential 
function are a subset of the NE of the game. Furthermore, 
in the case that all players’ utilities are quasi-concave, the 
maximum is unique and coincides with the stable equilibrium 
of the game. 

This same approach can be extended to dynamic games. 
Nevertheless, instead of obtaining an analogous optimization 
problem, DPG will yield an analogous MOCP. 


IV. Dynamic Potential Games with Constraints 
This section introduces the main theoretical contribution of 
the paper: we establish conditions under which we can hnd an 


NE of problem 0 by solving an alternative MOCP, instead 
of having to solve the set of coupled inhnite horizon OCP 
with coupled constraints. First, we introduce the dehnition of 
a DPG and show conditions for problem ([T]) to belong to this 
class. Then, we introduce the alternative MOCP and prove that 
its solution is an NE of the game. 

Definition 3. Problem ([T]) is called a DPG if there is a function 
n : A" X if X IN —> JR, named the potential, that satisfies the 
following condition for every player i G Q: 

OO 

t^O 

OO 

t=0 

VxtGA-, AuIvIgW (11) 


Note that, although the potential function If is dehned for 
the larger set A” x if x IN, the local objective tt* is only dehned 
over its local subset A"® x if x IN. Therefore, we only have 
to check whether condition is satished in each players’ 
subset. 

The following three lemmas give conditions under which 
problem 0 is a DPG (i.e., it satishes Dehnition [^. 

Lemma 2. Problem ([T]) is a DPG if there exists some function 
If (xt, Uj, t) that satisfies 

(x],Ut,f) _ an (xt,Ut,t) 

dx-^ ~ dxT 

(x],ut,f) _ an (xt,ut,f) 

du\ dul 

ymGX{i), Vi G Q, i = 0, ...,oo (12) 

Proof: We simply extend to dynamic games the argument 
for static games due to ph] Prop. 1]. From ( fTS] ) and Assump¬ 
tion [T] we have: 

(n(xt,U(,f) - TT* (xt,ut,f)) =0, Vm G A’(i) (13) 

^ (n(xt,Ut,f) - TT* (x*,ut,f)) = 0 (14) 

This means that the difference between the potential and each 
player’s utility depends neither on x"^ nor u\. Thus, we can 
express this difference as 

U{xt,u\ujr\t) - TT* 

= 0(xt-^u^-^^) (15) 

for some function 0 : A"”* x x IN —> 3R. Since ( [T5| l is 
satished for every u‘ G IF, we can subtract two versions of 
GH) with actions vd and u* in IF. Then, by arranging terms 
and summing over all f, we obtain GD- ■ 

Condition GD is usually difficult to check in practice 
because we do not know If beforehand. Fortunately, there 
are cases in which the player’s utilities have some separable 
structure that allows us to easily deduce that the game is of 
the potential type, as it is explained in the following lemma. 
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Lemma 3. Problem Q is a DPG if the utility function of 
every player i G Q can be expressed as the sum of a term 
that is common to all players plus another term that depends 
neither on its own action, nor on its own state-components: 

7 ^* =n{xt,ul,uf\t) ( 16 ) 

Proof: By taking the partial derivative of we obtain 
•HI- Therefore, we can apply Lemma (see also ph} Prop. 
1 ]). ■ 
However, posing the utility in the separable structure •mi 
may be difficult. We need a more general framework that 
allows us to check whether problem Q is a DPG when the 
player’s utilities have a nonobvious separable structure. This 
framework is formally introduced in the following lemma. 


Lemma 4. Problem Q is a DPG if all players’ utilities satisfy 
the following conditions, yi,j € Q, Vm £ T’(i), Vn € <T(j).' 


dx'ldul 


dx'fdul 


(17) 


(18) 

(19) 


dx'fdx^ dx^dxT 

du\du{ du{du\ 

Proof: Under Assumption [T] we can introduce the fol 
lowing vector field; 


F = ^ V^l TT^ (Xi\ Ut, f)^ , . . . , V^Q 71*3 (xf , Ut, f 

a7ri(xi,Ut,f) dTrQ{xf,Ut,t)\ 

’ dul duf ) 

where tt® (xj , Ut, f) = ( a^*^*’*^ ) ■ From Lemma 

1^ we can express ( |20l i as 

F = Vn(xt,Ui,f) (21) 


From Assumption and Lemma [T]l, we know that F is 
conservative. Hence, Lemma [T]2 establishes that the second 
partial derivatives must satisfy ([T7])-([Tg. ■ 

Introduce the following MOCP: 


V2 ■■ 


maximize 
{utlenjLo ^ 

s.t. 


^^‘n(xt,ut,f) 

( 22 ) 

Xt+i =/(xt,Ut,f), xo given 

5(xt,ut,t) < 0 


Let us consider the following assumption, which is needed for 
establishing equivalence between a DPG and the MOCP ( |22l l. 

Assumption 4. The MOCP (|2^ has a nonempty solution set. 


Sufficient—and easily verihable—conditions to satisfy As¬ 
sumption 1^ are given by the following lemma, which is a 
standard result in optimal control theory. 

Lemma 5. Let H : A x x IN —)■ [—oo, oo) be a proper con¬ 
tinuous function. And let any one of the following conditions 


hold for t = 1,. .., oo.' 

1) The constraint sets Ct are bounded. 

2) n(xt,Ut,f) —> — oo as ||(xt,U()|| —)• oo (coercive). 

3) There exists a scalar M such that the level sets, defined 

by {(xj, Ut, f)|n(xt, Ut, f) > are nonempty and 

bounded. 

Then, Vxg £ X, there exists an optimal sequence of actions 
{u*}“g that is solution to the MOCP ( |22| l. Moreover, there 
exists an optimal policy (/)* : A x IN —>■ which is a mapping 
from states to optimal actions, such that when applied over 
the state-trajectory {x(}“q, it provides an optimal sequence 
of actions {uj = (/)*(xt, f)}“g. 


Proof: Since H is proper, it has some nonempty level 
set. Since H is continuous, its bounded level sets are compact. 

Sections 


Hence, we can use |23 Prop. 3.1.7] (see, also 
1.2 and 3.6]) to establish existence of an optimal policy. ■ 
The main theoretical result of this work is that we can find 
an NE of a DPG by solving the MOCP ( |22| ). This is proved 
in the following theorem. 


Theorem 1. If problem ([T]) is a DPG, under Assumptions 
g the solution of the MOCP @ is an NE of Q when the 
objective function of the MOCP is given by 


n(xt,Ut,f) 



dx'f dX 


du\ dX J 

where r]{X) = {v'"{X))l^j^, |(A) = (r(A))^^, and r]{0)- 
^(0) and ? 7 ( 1 )-^( 1 ) correspond to the initial and final state- 
action conditions, respectively. 


The usefulness of Theorem [T] is that, in order to find an NE 
of ([T]), instead of solving several coupled control problems, we 
can check whether Q is a DPG (i.e., anyone of Lemmas 
holds). If so, we can find an NE by computing the potential 
function ( |2^ and, then, by solving the equivalent MOCP ( |2^ . 

Proof: The proof is structured in five steps. Eirst, we 
compute the Euler equation of the Lagrangian of the dynamic 
game and derive the KKT optimality conditions. Assumption 
is required to ensure that the KKT conditions hold at the 
optimal point and that there exist feasible dual variables | |20] 
Prop. 3.3.8]. Second, we study when the necessary optimality 
conditions of the game become equal to those of the MOCP. 
Third, we show that having the same necessary optimality 
conditions is sufficient condition for the dynamic game to be 
potential. Eourth, having established that the dynamic game 
is a DPG we show that the solution to the MOCP (whose 
existence is guaranteed by Assumption is also an NE of 
the DPG. Einally, we derive the per stage utility of the MOCP 
as the potential function of a suitable vector field. We proceed 
to explain the details. 

Eirst, for problem Q, introduce each player’s Lagrangian 
Vie Q: 

OO 

t^O 
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+ (/(xt,Ui,i) -Xt+i) +/xf g(xt,Ut,t)^ 

OO 

= ^/3‘$*(xi,u*,t,Aj,/xj) (24) 

t^O 

where AJ = and fil = are the correspond¬ 

ing vectors of multipliers, and we introduced the shorthand: 

$*(xt,ut,f, Aj,/rj) =7r* (xJ,Ut,f) 

+ K" if (xt, ut, f) - xt+i) -f nf g{xt,ut, t) (25) 


The discrete time Euler-Lagrange equations fTS] Sec. 6.1] 
applied to each player’s Lagrangian are given by: 


(xt-i,ut-i,t- 1 , A]i,/x]i) 

d<t>^ (xt-i,ut_i,f- 1, A]_i,/x]_i) 

du\ 

(xt,ut,f, A],/r]) _ 

du\ 


Vm e X{i) (26) 


(27) 


Actually, note that ([26|)-([27ll are the Euler-Lagrange equations 
in a more general form than the standard reduced form. As 
mentioned in Sec.|II](see also, e.g., fTS] Sec. 6 . 1 ], Q), in the 
standard reduced form, the current action can be posed as a 
function of the current and future states: U( = (p(xt, Xt+i,f), 
for some function tp ■. X x X xN—The reason why we 
introduced this general form of the Euler-Lagrange equations 
is that such function p may not exist for an arbitrary state- 
transition function /. By substituting ( |25] l into (|26|)-(|27|), and 
adding the corresponding constraints, we obtain the KKT 
conditions of the game for every player i G Q, the state- 
components m G X{i), and all extra constraints: 




dxV 


k^l 

C 


C=1 

s 


dx'f 

dg‘^ (xt,ut,f) 
dxV^ 


- = 0 (28) 


(97r* (xj, ut, f) ^ (xt, ut, f) 


dul 


k=l 

C 

E 


dt 


dul 

dg’^ {xt,ut,t) 
dul 


= 0 


Xt+i =/(xt,Ut,f), p(xt,Ut,f)<0 
< 0 , fif g(xt,ut,t) =0 


(29) 

(30) 

(31) 


Second, we find the KKT conditions of the MOCP. To do 
so, we obtain the Lagrangian of ( |22l i: 

OO 

£"(xt,ut,7t,<5t) = ^/3*(n(xt,ut,f) 

t^O 

+ 77 if - Xt+i) + 6j g (32) 

where ■jt = (7^)Li “ ^^t )c= are the corresponding 


multipliers. Again, from ( [32| ) we derive the Euler-Lagrange 
equations, which, together with the corresponding constraints, 
yield the KKT system of optimality conditions for all state- 
components, m = 1,... ,S, and all actions, i = 1,... ,Q: 


gn (Xf,ut,f) 

dxT 


an (xt,Ut,t) 

dul 


E 


,,df^ (xt,ut,f) 


It 


k=l 


dxl 


c 


- 7t-l = 0 


C =1 


dxl' 


Y.-i< 


y.df^ (Xt,U 4 ,f) 


fc=l 


dul 


c 




C =1 


dul 


xt-ri =/(xt,ut,f), p(xt,ut,f)<0 
< 0 , <57g(xt,U4,f) = 0 


(33) 


(34) 

(35) 

(36) 


In order for the MOCP p2| ) to have the same optimality 
conditions as the game Q, by comparing (|28|)-(|3T]l with 
we conclude that the following conditions must be 
satisfied Vi G Q: 


Stt* (xj,ut,f) 9n(xt,ut,f) 


dxV 


dxl 


Vm G X(i) (37) 


d'K'^ {■x.l,ut,t) _ dU{xt,Ut,t) 

dul au] 

K = it, tA = St 


(38) 

(39) 


Third, when conditions (|J7|)-(|38|) are satisfied. Lemma 
states that problem ([T]i is a DPG. 

Eourth, note that condition ( |39l l represents a feasible point 
of the game. The reason is that if there exists an optimal 
primal variable, then the existence of dual variables in the 
MOCP is guaranteed by suitable regularity conditions. Since 
the existence of optimal primal variables of the MOCP is 
ensured by Assumption]^ the regularity conditions established 
by Assumption guarantee that there exist some 7 ^ and St 
that satisfy the KKT conditions of the MOCP. Substituting 
these dual variables of the MOCP in place of the individual 
AJ and /rj in (HI)-® for every i G Q, results in a system 
of equations where the only unknowns are the user strategies. 
This system has exactly the same structure as the one already 
presented for the MOCP in the primal variables. Therefore, 
the MOCP primal solution also satisfies the KKT conditions 
of the DPG. Indeed, it is straightforward to see that an optimal 
solution of the MOCP is also an NE of the game. Let {uj }“g 
denote the MOCP solution, so that it satisfies the following 
inequality Vu] G TfV 

OO OO 

5] /3‘n(xt, ul\ ur\t) > Y, /?‘n(xt, ul nr\ t) (40) 

t =0 t =0 

Prom Definition we conclude that the MOCP optimal 
solution is also an NE of game 0 - The opposite may not be 
true in general. Indeed, this solution, in which dual variables 
are shared between players, is only a subclass of the possible 
NE of the game. Nevertheless, other NE that do not share this 
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property have been referred to as unstable by m for static 
games. 

Fifth, although we have shown that we can find an NE of the 
DPG by solving a MOCP, we still need to hnd the objective 
of the MOCP. In order to find If, we deduce from ( |37| ), ( [38l l, 
and ( | 2 T| i that the vector held ( | 20 | ) can be expressed as 

vn(xt,ut,f) (41) 

Lemma[T]establishes that F is conservative. Thus, the objective 
of the MOCP is the potential of the held, which can be 
computed through the line integral ( [23l ). ■ 

In the next sections, we show how to apply this 
methodology—of solving DPG through an equivalent 
MOCP—to different practical problems. 

V. Energy Demand in the Smart Grid 
AS A Linear Quadratic Dynamic Game 

Our hrst example consists in a linear-quadratic-dynamic- 
game (LQDG) that solves a smart grid resource allocation 
problem. LQDG are convenient because they are amenable to 
analytical and closed form solutions pT} Ch. 6 ]. Our analysis 
is novel though. To the best of our knowledge, LQDG have 
not been studied under the easier DPG framework before. 

A. Energy demand control DPG and equivalent MOCP 

Consider a community of Q users (i.e., players) that use the 
smart grid resources in different activities (like communica¬ 
tions, heating, lighting, home appliances or production needs). 
Suppose that the electrical grid has S types of energy resources 
(such as rechargeable batteries, coal, fuel, hydroelectric power 
or biomass). The state of the game Xj G is the total 
amount of overall resources in the smart grid at time t. All 
players share all components of the state-vector (i.e., X'’ = X 
and X{i) = S'}, Vi G Q). The amount of resources 

consumed or contributed by player i at time t is denoted by the 
action vector uj G 5ft , where A* is the number of activities. 

The expenditure and contribution of each player i is 
weighted by matrix B® G 5ft'®^"^ . Also, resources can be 
autonomously recharged/depleted, which is modeled by a 
shared matrix C G 5ft'® Thus, the state transition of the 
system is / = Cx^ -t- B*u*. 

We consider two cost terms: unsatished demand and un¬ 
balanced resources. Given the available resources xj, every 
player z will have a target demand D*Xt that it wants to 
satisfy, for some demand matrix D* G 5ftThe disutility 
from an unsatished demand is modelled by the quadratic 
form (D®Xt — uj) Q® (D®xt — uj), with demand cost ma¬ 
trix Q® G In addition, the available resources should 

be just enough to satisfy the demand. There is a cost for 
having too little (e.g., productivity decrease) or too much (e.g., 
storage costs) resources. This cost can be modeled as another 
quadratic form: (xj — X(_i)^ R (xt — xt_i), with unbalanced 
resources cost matrix R G 5ft®^®. In order to pose the game as 
a maximization problem, we assume {Q^IvigC to be negative 
dehnite matrices, and R a negative semidehnite matrix (this 
is represented by Q® ^ 0 and R AO). 


The dynamic energy demand control game is given by the 
following coupled optimal control problems: 


03 : 
ViGQ 


maximize /3* ( (x* - Xt_i)^ R (xt - Xt_i) 

+ (D®x* - u®)^ Q® (D®xt - u®) ) (42) 
Q 


s.t. xt+i = Cxt -f ^ B®Ut, xo given 

Z =1 


By defining augmented state and action vectors: 


x; ^ [x,',x,'_i,J , S®^D®xt-uJ (43) 
can rewrite ( |42l i in the standard linear-quadratic form: 

Ix^Rxt -|-uJ^Q®uJ 


maximize 

{“QenSoW* 






(44) 


s.t. Xt+1 = Axt - y^B®Ut, Xo given 


Z =1 


where 


A4 

■c + E,gqB®d® 

Osxs 

, B® = 

B® 


Is 

OsxS 


OsxA> 


R 


R -R 
-R R 


(45) 

(46) 


and where I 5 and OsxS denote the identity and null matrices 
of size S X S, respectively. LQDG games in the form ( |44| ) 
have been presented in Ch. 6 ], where an NE is found 
by i) solving the system of coupled hnite horizon OCP, ii) 
hnding the limit of this solution as the horizon tends to inhnity, 
and then Hi) verifying that this limiting solution provides a 
NE solution for the infinite-horizon game. Here we follow a 
different and simpler approach. Lirst, we show that problem 
( |44| ) can be expressed in the separable form ([T^: 

7r®(xt,ut) = x^Rxt -I- uJ^Q®u®j 

= x7Rxt + X]uP Q^uf- 

peQ 

We identify the potential and separable functions in ( |47l i: 

n (xt, Ut, Q = x7 Rxt - 1 - uf QPuf (48) 

peQ 

0(urA) = - (49) 

j&Q-jAi 

Lrom Lemma we conclude that problem ( |42] i is a DPG. 
Note also that Assumptions [T]|^ hold. Moreover, the objective 
in ( |44l i is concave and the state dynamics—which is the only 
equality constraint—are linear. Therefore, Slater’s constraint 
qualihcation is satished and Assumption]^ holds. In addition, 
the matrices Q® ^ 0, V* G Q, and R ^ 0 make the potential 
( |49| ) coercive. Hence, Lemma states that Assumption is 
satished. Since Assumptions [TMhold, Theoremestablishes 
that we can hnd an NE of (|42ll by solving an equivalent 







MOCP; 


maximize 
{ut}Gn“ 0 ^ 


Vs : 


s.t. 


^(Xo) = ^/3*(x7Rxt 

t^O 

peQ 

Q 

%+i = Axt - ^ B*Uj, xo given 


where the cumulative objective function V is known as value 
function in the optimal control literature (see, e.g., 123)). Let 
Ut = be the vector of all players’ augmented actions. 

Aggregate all players’ demand matrices in a block diagonal 
matrix Q = diag (Q\ ..., Q'^) of size ^ "4*^ 

and aggregate all players’ expenditure weighting matrices in 
a S' X ^ thick matrix B = ^B^,..., B'^^ Then, we 

can rewrite the value and transition functions as follows: 

OO 

^(Xo) = 

Xi+i = Ax^ - But (52) 


B. Analytical solution to the MOCP and simulation results 
It is well known that the value function satisfies a recursive 
relationship, known as Bellman equation (see, e.g., p^): 

V(xt) = /3‘ (^ujQut + xjRxt^ + /3*+V(xt+i) (53) 

Moreover, for an LQ control problem, it is known p?) Ch. 6] 
that the optimal value function can be expressed as a quadratic 
form of the state: 


(54) 

for some negative semidefinite matrix P. We can use ( |5^ 
to find a closed form expression for the sequence of optimal 
actions as follows. Expand ( |5^ and ( |5^ into (153): 

V{xt) = 13* Qut + x7Rxt^ 

+ /3‘+i(AXt-BSt)^P(AXt-Bnt) (55) 

Now, we just have to maximize ( [55) over Uj. Since Q and P 
are negative definite and semidefinite matrices, respectively, a 
necessary and sufficient condition for the maximum is 

Vu,P(xO = /3‘QSt - /3‘+iBTp {Axt - BB*) = 0 (56) 

From ( |5^ , we obtain an analytical expression for the optimal 
action at any time step: 

St = ^(Q + /3BTpB)”^B^PAxt (57) 


If we are also interested in finding the optimal value, we 
can expand 0 into ([^ and isolate P: 

P = R + /3A^PA 

- /3^ A^PB (Q + /3B^PB) B^PA (58) 


Note that ( |58) is a discrete algebraic Riccati equatjon, which 
is known to be a contraction mapping if Q ^ 0, R ^ 0 and 
the spectral radius of A is smaller than one |25 Ch. 5] (the 


analysis can be performed under weaker conditions though 
p^ , p6)). When ( |58) is a contraction, it has a unique solution 
P* that can be approximated by iterating the following fixed 
point equation, such that lim„_>oo Pn = P*: 

P n+i = R + /3A^P „A 

-/32A^P„B(Q + /3B^P„B)”^B^P„A (59) 


We have simulated the smart grid model for (5 = 8 players, 
S' = 4 resources. A* = 6 activities for every player, random 
negative definite matrices Q*, Vt S Q, and random negative 
semidefinite matrix R (to build these negative matrices we 
build an intermediate matrix, e.g., Rint, by drawing random 
numbers from a uniform distribution, with support [0,10] for 
Q* and [0, 5] for R, and compute R = —R^^tRint)- Matrices 
C, B® and D® are also random with elements drawn from the 
spherical normal distribution. Finally, the initial state was set 
to a vector of ones, and discount factor /3 = 0.9. 

Figure [T) Top shows the instant utilities per player over time. 
Recall that the utilities have been defined as negative costs. 
Therefore, each player’s utility starts being a negative value 
and converges to zero with time. This behaviour illustrates that 
all players attain an NE in which they are able to satisfy their 
demand as well as to hold just enough available resources. 
Figure [T)Bottom shows the evolution of the part of the cost 
corresponding to the individual coefficients uj = D®Xt — uj. 
These coefficients represent the mismatch among target de¬ 
mand, D®X(, and the actual player activities uj. We can see 
that the agents adjust their actions uj to satisfy the target 
demand. The equilibrium between target demand and players’ 
activities is an expected consequence of the stability of the 
LQ game in infinite horizon |24 Ch. 6]. 



Time 



Fig. 1. Dynamic smart grid scenario with Q = 8 players. (Top) Instant utility 
values of players. (Bottom) Players’ decision coefficients evolution in time. 
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VI. Network Flow Control: Infinite Horizon 
Approximated by a Finite Horizon Dynamic Game 


Several works (see, e.g., 127 1-]3())) have considered network 


flow control as an optimization problem wherein each source 
is characterized by a utility function that depends on the 
transmission rate, and the goal is to maximize the aggregated 
utility. We generalize the standard model by considering that 
the nodes are equipped with batteries that are depleted propor¬ 
tionally to the outgoing flow. In addition we consider several 
layers of relay nodes, each one with multiple links, so there 
are several paths between source and destination. When the 
batteries are completely depleted, no more transmissions are 
allowed and the game is over. Hence, although we formulate 
the problem as an infinite horizon dynamic game, the effective 
time horizon—^before the batteries deplete—is finite. This 
problem has no known analytical solution, but the utilities 
are concave. Therefore, the finite horizon approximation is 
convenient because we can solve an equivalent concave op¬ 
timization problem, significantly reducing the computational 
load with respect to other optimal control algorithms (e.g., 
dynamic programming). 


A. Network flow control dynamic game and equivalent MOCP 
Let u\°‘ denote the flow along path a for user i at time t. 
Suppose there are A* possible paths for each player i G Q, 
so that uj = denotes the i-th player’s action vector. 

Let A = denote the total number of available paths. 

Suppose there are S relay nodes. Let denote the battery 
level of relay node k. The state of the game is given by X( = 
such that all players share all components of the 
state-vector (i.e., A* = A and X{i) = {1,..., S}, Vi S Q). 
The battery level evolves with the following state-transition 
equation for all components fc = 1,..., S': 

Q 

^ uT, = (60) 

i=l 

where Fk denotes the subset of flows through node k, 
is a positive scalar that stands for the maximum battery level 
of node k, and 5 is a proportional factor. 

Similar to the standard static flow control problem, each 
player intends to maximize a concave function F : 3? of 

the sum of rates across all available paths. This function F can 
take different forms depending on the scenario under study, 
like the square root 0 or a capacity form. In addition to the 
transmission rate, we include the relay nodes’ battery level in 
each player’s utility, weighted by some positive parameter a. 
The combination of these two objectives can be understood 
as the player aiming to maximize its total transmission rate, 
while saving the batteries of the relays. 

There is some capacity constraint of the maximum aggre¬ 
gated rate at every relay and destination node. Let Cmax S 
denote the vector with maximum capacities, where L is the 
number of relays plus destination nodes. Let M = [mia] 
denote the L x A matrix that define the aggregated flows for 
each relay and destination node, such that element mia = 1, if 
flow node a is aggregated in node I, and mia = 0 otherwise. 


The dynamic network flow control game is given by the 
following set of coupled OCP: 


maximize 

{^t}^nt=o 

(^4 : s.t. 

Vie Q 


(r (Eiirl+crtxA 
i=0 y ya=l J k=l J 

v«=v-<si; E < ( 61 ) 

i=i u'^eFk 

MUt < Cinax, U\°- > 0 

k = 1,..., S, a = 1,..., A* 


Note that each player’s utility can be expressed in separable 
form: 


A* 




t a=l 










la 

ui 


ieQ \a=l 


fc=l jeQ:j^i \a=l 


(62) 


Therefore, Lemma establishes that problem ( | 6 T] l is a DPG, 
with potential function given by: 


A' \ S 

E ““ +«E 

a=l J k=l 

Before applying Theorem we have to check whether 
Assumptio ns [T}|4| are satisfied. We follow 0 and choose 
r(-) — vAe + O (where e > 0 is only added to avoid 
differentiability issues when u'fl = 0). Let X and IF be 
open convex sets containing the Cartesian products of intervals 
[0, i?max] [Oj oo), respectively. It follows that Assumptions 
[TJI^hold. Moreover, since F is concave and problem ( |M] l has 
linear equality constraints and concave inequality constraints. 
Slater’s condition holds, i.e., Assumption|^is satisfied. Finally, 
since the constraint set in 0 is compact. Lemma [^1 states 
that Assumption holds. Hence, Theorem establishes that 
we can find an NE of (| 6 T]) by solving the following MOCP: 


n(xt,ut,f) = E^ 


IGQ 


maximize 
{utlGlltLo ^ 


E/^MEME 

gC 


: 


s.t. X 


t=0 

fc 


a—1 





=V-«E E 


iGQ u'^°‘£Fk 


(64) 


-IVIrii E I-max) Lp ^ 0 

k = l,...,S 


B. Finite horizon approximation and simulation results 

As opposed to the LQ smart-grid problem, there is not 
known closed form solution for problem ( |64| ). Thus, we have 
to rely on numerical methods to solve the MOCP. Suppose 
that we set the weight parameter a in H low enough to 
incentivize some positive transmission. Eventually, the nodes’ 
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batteries will be depleted, so the system will get stuck in 
an equilibrium state, with no further state transitions. Thus, 
we can approximate the inhnite-horizon problem ( |64l i as a 
hnite-horizon problem, with horizon bounded by the time- 
step at which all batteries have been depleted. Moreover, in 
our setting, we have assumed F to be concave. Therefore, we 
can effectively solve ( |64l i with convex optimization solvers 
(we use the software described in p^). The beneht of using 
a convex optimization solver is that standard optimal control 
algorithms are computationally demanding when the state and 
action spaces are subsets of vector spaces. 

For our numerical experiment, we consider Q = 2 players 
that share a network of S' = 4 relay nodes, organized in two 
layers (see Figure |^. In this particular setting, each player is 
allowed to use four paths, = 4. The connectivity 

matrix M can be obtained from Figure The battery is 
initialized to i?max = 1 for the four relay nodes, we set the 
depleting factor 6 = 0.05, discount factor /3 = 0.9, the weight 
a = 1, e = 0.001 and the vector of maximum capacities 
Cmax = [0.5,0.15,0.5,0.15,0.4, 0.4]T. 





Fig. 2. Network scenario for two users and two levels of relying nodes. 
Player Si aims to transmit to destination Di, while S 2 aims to transmit to 
destination D 2 . They have to share relay nodes A^i,..., A^ 4 . We denote the 
L = 6 aggregated flows as Li,..., Lq. 


Figure shows the evolution of the L = 6 aggregated 
flows, the 4 = 8 flows and the battery of each of the 
iV = 4 relay nodes. Since we have included the battery level 
of the relay nodes in the users’ utilities (i.e., a > 0), the 
users have an extra incentive to limit their flow rate. Thus, 
there are two effective reasons to limit the flow rate; satisfy 
the problem constraints and save battery. We can see that 
the aggregated flows with higher maximum capacity are not 
saturated (Li < 0.5, < 0.5, L 4 < 0.4, and Lq < 0.4). 

The reason is that the users have limited their individual flow 
rates in order to save relays’ batteries. On the other hand, the 
aggregated flows with lower maximum capacity are saturated 
{L 2 — L 4 = 0.15) because the capacity constraint is more 
restrictive than the self-limitation incentive. When the batteries 
of the nodes with higher maximum capacity (Ni, N 3 ) are 
depleted (around t — 70), the flows through these nodes 
stop. This allows the other flows to transmit at a 

higher rate. At this time, the capacity constraint in L 2 , L 4 is 
more restrictive than the self-limitation incentive for saving 
the batteries, so that the users transmit at the maximum rate 


allowed by the capacity constraints (note that L 2 = L 4 = 0.15 
remains constant). When the battery of every node is depleted, 
none of the users is allowed to transmit anymore and the 
system enters in an equilibrium state. 

We remark that the solution obtained is an NE based on 
an OL game analysis. Finally, the results shown in Figure 
have been obtained with a centralized convex optimization 
algorithm, meaning that it should be run off-line by the sys¬ 
tem designer, before deploying the real system. Alternatively, 
we could have used the distributed algorithms proposed by 
reference p3) , enabling the players to solve the finite horizon 
approximation of problem ( |64| ) in a decentralized manner, even 
with the coupled capacity constraints. 




Fig. 3. Network flow control with Q = 2 players, S = 4 relay nodes and 
= 4 available paths per node. (Top) Aggregated flow rates at 
Li,. .., Lq. (Middle) Flow for each of the A = 8 available paths. (Bottom) 
Battery level in each of the S = 4 relay nodes. 


VII. Dynamic Multiple Access Channel; 
Nonseparable utilities 

In this section, we consider an uplink scenario in which 
every user i G Q independently chooses its transmitter power, 
ul, aiming to achieve the maximum rate allowed by the 


channel 118|. If multiple users transmit at the same time, they 
will interfere each other, which will decrease their rate, so 
that they have to And an equilibrium. Let R\ denote the rate 
achieved by user i with normalized noise at time t: 


= log I 1 + 




1 + E 


j&Q.jAi 


ui 


(65) 
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where /i* denotes the fading channel coefficient of user i. 


A. Multiple access channel DPG and equivalent MOCP 

Let x\ € [O, i^max] denote the battery level for each player 
i G Q, which is discharged proportionally to the transmitted 
power ul. The state of the system is given by the vector with 
all individual battery levels; Xj = ^ 

player is only affected by its own battery, such that S = Q, 
A’(i) = {*} and xj = xj. Suppose the agents aim to maximize 
its transmission rate, while also saving their battery. This 
scenario yields the following dynamic game; 


maximize 

Gb ■ 

ViG Q s.t. 


{Rl + axl) 

xJ +1 =xl- 6ul, xl = 


where a is the weight given for saving the battery, <5 is the 
discharging factor, and and denote the maximum 

transmitter power and maximum available battery level for 
node i, respectively. Problem ( | 6 ^ is a dynamic infinite-horizon 
extension of the static problem proposed in p^ . 

Instead of looking for a separable structure in the players’ 
utilities, we show that Lemma holds and, hence, problem 
( | 66 l l is a DPG; 


d‘^n"{xl,ut,t) 

dx\du{ 

d'^TT"{xl,ut,t) 

dx\dxl 

d'^TT\x\,ut,t) 

du\du{ 


d‘^Tr^{xl,Ut,t) 


dxldu\ 
d‘^'K^{x 




dxldx\ 
d'^'K^{x\ 


= 0 






(67) 

( 68 ) 


- w? w\ 


dvidv\ ^ ^ 


\hp 


In order to find an equivalent MOCP, let us define A”® and W as 
open convex sets containing the closed intervals [O, TJ^ax] ™d 
[OjP^ax]’ respectively, so that Assumptions [UO hold. Derive 
the potential function from (| 2 ^; 


/ Q \ Q 

n(x(,ut,f) =log l + j (70) 


Since is concave and all equality and inequality con¬ 
straints in ( | 6 ^ are linear. Assumption is satisfied through 
Slater’s condition. Moreover, since the constraint set is com¬ 
pact and the potential is continuous. Lemma 1^ 1 establishes 
that Assumption holds. Therefore, Theorem^Tji states that 
we can find an NE of (| 66 |) by solving the following MOCP; 


maximize 
{utlenj^o ^ 


Vb ■■ 


s.t. 


oo / / Q \ 

log i+£i/*®pun 

t^O \ \ ) 

Q 

PaY^x\ 

i^\ 

xj+i =x\- 5u\, xl = 

0 < < P^ax, 0 < X® < P^^ax 

Vie Q 



B. Simulation results 


Similar to Sec. VI-B the system reaches an equilibrium 


state when the batteries have been depleted. Thus, the solution 
can be approximated by solving a finite horizon problem. 
Moreover, since the problem is concave, we can use convex 


optimization software, like |32|. Alternatively, we could solve 
the KKT system with an efficient ad-hoc distributed algorithm, 
like in | |T8| . 

We simulated an scenario with Q = 4 users. We set the 
maximum battery level S^ax = 33 for all users, the maximum 
power allowed per user P^^x = 6 for all users, the weight 
battery utility factor a = 0 . 001 , the transmitter power battery 
depletion factor 6 = 1, and the discount factor f3 = 0.95. The 
channel gains are \h^\ = 2.019. |/i^| = 1.002 \h^\ = 0.514, 
and \h^\ = 0.308. 

Figure shows appealing results; the solution of the 
MOCP—which is an NE of the game—is actually a sched¬ 
ule. In other words, instead of creating interference among 
users, they wait until the users with higher channel-gain have 
depleted their batteries. 



Time 



Fig. 4. Dynamic multiple access scenaiio with Q = 4 users. (Top) 
Sequence of transmitter power chosen by every user. (Bottom) Evolution of 
the transmission rates. 


VIII. Optimal scheduling; Nonstationary problem 

WITH DYNAMIC PROGRAMMING SOLUTION 

In this section we present the most general form of the pro¬ 
posed framework, and show its applicability to two scheduling 
problems. First, one of the games has nonseparable utilities, so 
we have to verify second order conditions ([T7])-([T9|). Second, 
neither the equivalent MOCP can be approximated by a finite 
horizon problem, nor the utilities are concave. Thus, we cannot 
rely upon convex optimization software and we have to use 
optimal control methods, like dynamic programming | [23) . 
Finally, we consider a nonstationary scenario, in which the 
channel coefficients evolve with time. This makes the state- 
transition equations (and the utility for the equal rate problem) 
depend not only on the current state, but also on time. This 
problem was introduced in the preliminary paper 
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A. Proportional fair and equal rate scheduling games and 
their equivalent MOCP 

Let us redefine the rate achieved by user i at time t, so that 
we consider nonstationary channel coefficients: 


/ 


Rl 4 log 


hi 


1 H- 


hi 


(72) 


ul 


where u\ is the transmitter power of player i, and \h\\ is its 
time-varying channel coefficient. 

We propose two different scheduling games, namely, pro¬ 
portional fair and equal rate scheduling. 

1) Proportional fair scheduling: Proportional fair is a 
compromise-based scheduling algorithm. It aims to maintain a 
balance between two competing interests: trying to maximize 
total throughput while, at the same time, guaranteeing a 
minimal level of service for all users (D-iisg. 

In order to achieve this tradeoff, we propose the following 
game: 


Gg ■■ 


maximize 


yi G Q s.t. 


S'*'*' 

= 0, 0 <ul< 


(73) 


where the state of the system is the vector of all players’ aver¬ 
age rates xt = Since each player aims to maximize 

its own average rate, the state-components are unshared among 
players: S = Q and X{i) = {f}. 

In order to show that problem ( |7^ is a DPG, we evaluate 
Lemma with positive result, and obtain If from ( [T6] i: 

Q 

n(xt,Ut,f) = (74) 

i=l 


Now, we show that we can derive an equivalent MOCP. It is 
clear that Assumptions [T]-^ hold. By taking the gradient of the 
constraints of and building a matrix with the gradients 
of the constraints (i.e., the gradient of each constraint is a 
column of this matrix), it is straightforward to show that the 
matrix is full rank. Hence, the linear independence constraint 
qualification holds (see, e.g., [ |20l Sec. 3.3.5], ||2T|), meaning 
that Assumption]^ is satisfied. Finally, since P] > 0 and Xq = 
0, we conclude that there exists some scalar M > 0 such that 
the level set {x(| > M} is nonempty and bounded, 

so that Lemma 1^3 establishes that Assumption]^ is satisfied. 
Thus, from Theorem ]T] we can find an NE of DPG ( |73| l by 
solving the following MOCP: 


maximize 

{ut}Gn“ 0 ^ 


Ve : 


s.t. 


OO Q 

i—1 




= l-T U 


a^o= 0 , 


Rl 


Q<u\< PU. 


(75) 


2 } Equal rate scheduling: In this problem, the aim of each 
user is to maximize its rate, while at the same time keeping the 


users’ cumulative rates as close as possible. Let x\ denote the 
cumulative rate of user i. The state of the system is the vector 
of all users’ cumulative rate x^ = Again S = Q and 

X(i) = {*}. This problem is modeled by the following game: 


maximize /3* ( (1 — a)Rl 


t=o 


Gr ■ 

yieQ 


— a 

s.t. =xl+Rl 

xh = 0, 0<ul<P, 


) (”76) 


where parameter a weights the contribution of both terms. 


It is easy to verify that conditions (]T7|-(]T^ are satisfied. 
Hence, from Lemma ]^ we know that problem ( |76l l is a DPG. 
In order to obtain an equivalent MOCP, let us define X'" 
and IP as open convex sets that contain the intervals [0, cx^ 
and [OjP^a^x]’ respectively. It follows that Assumptions JUO 
hold. Similar to the proportional fair scheduling problem (|73]i7 
As sumption ]^holds through the linear independence constraint 
qualification. Finally, let us check Assumption ]^ as follows. 
Derive the potential H by integrating ( |2^ : 

n(xt,ut,f) = (l-a)log 

<3-1 Q , ,2 

^ (^xl-xl^ (77) 

i—1 


We distinguish two extreme cases: i) all players have exactly 
the same rate (i.e., xl = xl, i,j = 1 ,...,(5 ); and ii) each 
player’s rate is different from any other player’s rate (i.e., 
xl xl, i j). When all players have exactly the same rate, 
the terms {xl-xff vanish for all (f,j) pairs, and ( [77] i only 
depends on the actions (the state becomes irrelevant). Since 
the action constraint set is compact, existence of solution is 
guaranteed by Lernma]^!. When each player’s rate is different 
from any other player’s rate, the term —{xl — xlY is coercive, 
so that becomes coercive too (since the constraint action 
set is compact, the term depending on u\ is bounded). Thus, 
existence of optimal solution is guaranteed by Lemma ]^2. 
Finally, the case where some player’s rate are equal and 
some are different is a combination of the two cases already 
mentioned, so that the equal terms vanish and the different 
terms make Cz) coercive. Hence, Theorem ]T] states that we 
can find an NE of DPG ( |7^ by solving the following MOCP: 


maximize 
{utlGHtLo ^ 


V7 : 


s.t. 
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/?‘ (1 - a) log 



<3 ) <3 , . \ 2 \ 

-Q<E E 

j=i+l / 

xl^,=xl + Ri, 

0 < uj < 
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B. Solving the MOCP with dynamic programming and simu¬ 
lation results 


Although Lemmaj^establishes existence of optimal solution 
to these MOCP, these problems are nonconcave and cannot be 
approximated by finite horizon problems. Thus, we cannot rely 
on efficient convex optimization software. In order to numeri¬ 
cally solve these problems, we can use dynamic programming 
methods | [23l . 

Standard dynamic programming methods assume that the 
MOCP is stationary. One standard method to cope with 
nonstationary MOCP is to augment the state space so that it 
includes the time as an extra dimension for some time length 
T. Let the augmented state-vector at time t be denoted by Xj = 
(xt, t) G X = X X {0,... ,T}. The state-transition equation in 
the augmented state space becomes f : X xU ^ X. Since we 
are tackling an infinite horizon problem, when augmenting the 
state space with the time dimension, it is convenient to impose 
a periodic time variation: 


f{Xt,Ut) = 


f -f 1 (if f < T) or 0 (if f = T) 


(79) 


Otherwise, it could be difficult to apply computational dy¬ 
namic programming methods. 

One further difficulty for solving MOCP with continuous 
state and action spaces is that dynamic programming meth¬ 
ods are mainly derived for discrete state-action spaces. Two 
common approaches to overcome this limitation are i) to 
use a parametric approximation of the value function (e.g., 
consider a neural network with inputs the continuous state 
action variables that is trained by minimizing the error in the 
Bellman equation); or ii) to discretize the continuous spaces, 
so the value function is approximated in a set of points. For 
simplicity, we follow the discretization approach here. We 
remark that it may be problematic to finely discretize the state- 
action spaces in high-dimensional problems though, since the 
computational load increases exponentially with the number 
of states. These and other approximation techniques, usually 
known as approximate dynamic programming, are still an 
active area of research (see, e.g., |23 Ch. 6], p8)). 

Introduce the optimal value function for the augmented set: 


V*i^o)= max V/3*n(x4,Ut,f) 

oo 

= ^/3*n(xt,0*(xt,f),f) 

oo 

= ^/3‘n(Xt,u*,f) (80) 

t=0 


where (j)* : X ^ U is the optimal policy that provides the 
sequence of actions {u^ = 0*(xt)}“Q that is the solution 
to the MOCP, as explained by Lemma Then, the Bellman 
optimality equation is given by 

C*(St) = n(X*, u*) + pv*{J{xt, u*)) (81) 


Among the available dynamic programming methods, we 
choose value iteration (VI) for its reduced complexity per 


iteration with respect to policy iteration (PI), which is es¬ 
pecially relevant when the state-grid has fine resolution (i.e., 
large number of states). VI is obtained by turning the Bellman 
optimality equation (|8T| into an update rule, so that it gener¬ 
ates a sequence of value functions that converge to the 
optimal value (i.e., limfe_>oo = V*, where Vq is arbitrary). 
In particular, at every iteration k, we obtain the policy (f) that 
maximizes (policy improvement). Then, we update the 
value function for the latest policy (policy evaluation). 

VI is summarized in Algorithm where the operator [x] 
denotes the closest point to x in the discrete grid. 


Algorithm 1: Value Iteration for the non-stationary MOCP 

Inputs: number of states S, threshold e 

Discretize the augmented space X into a grid of S states 

Initialize A = c», k = 0 and Vo(xs) = 0 for s = 1... S' 

while A > e 

for every state s = 1 to S do 

Xs G- the s-th point on the grid 

(j){xs) = argmaxu n(xs,u) -j- (3Vk(Jf {xs,ti)']) 

Vk+l{Xs) = n(Xs,(/)(Xs)) -b/3Vfc(|'/(Xs,(/)(Xs)])) 

end for 

k = k 1 

A = maxg |Vfc+i(xs)) - 14(xs))| 

end while 

Return: (j)(xs) and I4+i(xs) for s = 1,..., S 


Note that the output of the value iteration algorithm is a 
policy (i.e., a function), rather than a sequence of actions. This 
result allows to compute the optimal actions of every player 
from the current state at every time-step of the game. When 
there is no reason to propose a finite-horizon approximation 
of the game, a policy is a more practical representation of the 
solution than an infinite sequence of actions. 

We simulate a simple scenario with Q — 2 users. The 
channel coefficients are sinusoids with different frequency and 
different amplitude for each user (see Fig. |^. The maximum 
transmitter power is Pmax == T^max = 5, with 20 possible 
power levels per user, which amounts to 400 possible actions. 
We discretize the state-space (i.e., the users’ rates) into a grid 
of 30 points per user. The nonstationarity of the environment 
is surmounted by augmenting the state-space with T = 20 
time steps. Hence, the augmented state space has a total of 
30^ X 20 = 18.000 states. For the equal-rate problem, the 
utility function uses a = 0.9. 

The solution of the proportional fair game leads to an 
efficient scheduler (see Figure [^, in which both users try to 
minimize interference so that they approach their respective 
maximum rates. 

For the equal rate problem, we observe that the agents 
achieve much lower rate, but very similar between them (see 
Figure 0. The trend is that the user with a channel with less 
gain (User 2, red-dashed line) tries to achieve its maximum 
rate, while the user with higher gain channel (User 1, blue- 
continuous line) reduces its transmitter power to match the 
rate of the other user. In other words, the user with poorest 
channel sets a bottleneck for the other user. 
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Time 

Fig. 5. Periodic time variation of the channel coefficients for Q = 2 

users. All possible combination of coefficients are included in a window of 
T = 20 time steps. 


O 

a. 


C3 

Uh 

D 



Time 


Fig. 6. Proportional fair scheduling problem for Q = 2 users. (Top) 
Transmitter power u.J. (Bottom) Average rate xl given by (ZD- Both users 
achieve near maximum average rates for their channel coefficients 



Time 


Fig. 7. Equal rate problem for Q = 2 users. (Top) Transmitter power 
(Bottom) Average rate xl/t (recall that xl given by denotes accumulated 
rate). User 1 reduces its average rate to match that of User 2, regardless of 
having higher channel coefficient. 


Finally, note that Algorithm 1 is centralized, such that the 
results displayed in Figures 7 have been obtained assum¬ 
ing the existence of a central unit that knows the channel 
coefficients, transmission power and average rate for all users, 
so that it can update the value and policy functions for all 
states. We remark that the design and analysis of distributed 
dynamic programming algorithms when multiple players share 
state-vector components and/or have coupled constraints is a 


nontrivial task. Nevertheless, when the players share no state- 
vector components and they have uncoupled constraints, there 
are distributed implementations of VI and PI that converge 
to the optimal solution p9)-||42|. This is indeed the case for 
problems ( [75] ) and ( |78] l, where each agent i has a unique state- 
vector component xl and the constraints are uncoupled. There¬ 
fore, the agents could solve these problems in a decentralized 
manner. 


IX. Conclusions 

DPG provide a useful framework for competitive multia¬ 
gent applications under time-varying environments. On one 
hand, DPG allows nonstationary scenarios, thus, more realistic 
models. On the other hand, the analysis and solution of DPG 
is affordable through an equivalent MOCP We presented a 
complete description of DPG and provided conditions for a 
dynamic game with constrained state and action sets to be 
of the potential type. To the best of our knowledge, previous 
works have not dealt with DPG with constraints explicitly. 

We also introduced a range of communication and network¬ 
ing examples: energy demand control in a smart-grid network, 
network flow with relays that have bounded link capacity and 
limited battery life, multiple access communication in which 
users have to optimize the use of their batteries, and two 
optimal scheduling games with nonstationary channels. Al¬ 
though these problems have different features each—including 
utilities in separable and nonseparable form, convex and non- 
convex objectives, closed-form and numerical solutions, and 
solution methods based on convex optimization and dynamic 
programming algorithms—the proposed framework allowed us 
to analyze and solve them in a unified manner. 

The DPG framework is promising in the sense that, once 
the equivalent MOCP has been formulated, it is possible to use 
ideas from optimal control theory to extend the current analy¬ 
sis. In particular, we have assumed that the agents can observe 
all the variables that influence their objective functions and 
constraints. This is known as perfect information. Although 
perfect information is reasonable in many applications, there 
are problems in which all the information is not available to 
all agents. An example of games with imperfect information 
is when the agents cannot directly observe the variables that 
influence their objective and constraints; rather, they only 
have access to another variable that depends on the state. 
The current framework could possibly be extended to this 
case by using a partially-observable-Markov-decision-process 
(POMDP) formulation ||^, | |44) . Nevertheless, other forms 
of imperfect information—like when the agents cannot see 
other players’ actions—would require further study. Another 
possible direction to extend the current analysis is to allow 
stochastic state transitions and utilities (i.e., considering 
and TT® random variables). This can be done by applying the 
Euler equation to the stochastic Lagrangian in order to derive 
a set of stochastic optimality conditions. Finally, we could 
also consider the case where the agents know nothing about 
the problem; rather, they have to learn the optimal policy 
from trial-and-error experimentation. To this end, we could 
apply reinforcement learning (RL) and approximate dynamic 
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programming (APD) techniques (such as Q-leaming) p?)- 
p7| , Ch. 6]. The main difficulty with standard APD/RL 
techniques is that they have been developed for unconstrained 
MOCP, and some adaptation of these techniques is necessary. 
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